Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set timeout for helper subprocesses to enhance stability #11125

Merged
merged 11 commits into from
Dec 18, 2024

Conversation

kbukum1
Copy link
Contributor

@kbukum1 kbukum1 commented Dec 13, 2024

What are you trying to accomplish?

This PR introduces a timeout mechanism for running helper subprocesses to ensure they do not run indefinitely, enhancing system stability and robustness. The change is gated behind the feature flag: enable_shared_helpers_command_timeout.


What and Why:

  • Added the capture3_with_timeout method to manage subprocess execution with a configurable timeout.
  • Integrated the timeout mechanism into the run_helper_subprocess method by replacing the existing Open3.capture3 call with capture3_with_timeout.
  • Set the default timeout for helper subprocesses to 120 seconds to prevent indefinite blocking.
    Note: The timeout applies to the time between outputs to stdout or stderr, not the overall execution duration of the command. Each time the command produces output, the timeout is reset, and it waits for the next response.
  • Feature Flag: The timeout mechanism is gated behind the enable_shared_helpers_command_timeout feature flag. This ensures no impact to existing workflows unless the flag is enabled.

Related Issues:

  • Ensures reliability in subprocess management, addressing potential issues with stuck or long-running processes.
  • Allows gradual rollout via the feature flag to validate its impact.

Anything you want to highlight for special attention from reviewers?

  • Feature Flag Implementation: The change is controlled by the enable_shared_helpers_command_timeout feature flag to ensure it can be toggled on/off safely.
  • Exception Handling: Proper exception handling for terminated processes and cleanup mechanisms (Process.kill and Process.wait) to avoid zombie processes.
  • Timeout Behavior: The timeout resets on progressive outputs to accommodate long-running processes that produce regular outputs.

How will you know you've accomplished your goal?

  • Unit tests verifying that subprocesses terminate correctly after exceeding the timeout when the feature flag is enabled.
  • Manual testing with simulated long-running and progressive commands to validate timeout behavior and exception handling.
  • Ensuring all existing functionality dependent on subprocess calls works as expected when the feature flag is disabled.

Checklist

  • I have run the complete test suite to ensure all tests and linters pass.
  • I have thoroughly tested my code changes, including adding additional tests for new functionality and testing with the feature flag enabled/disabled.
  • I have written clear and descriptive commit messages.
  • I have provided a detailed description of the changes in the pull request, including the problem it addresses, how it fixes the problem, and any relevant details about the implementation.
  • I have ensured that the code is well-documented and easy to understand.

@github-actions github-actions bot added L: go:modules Golang modules L: rust:cargo Rust crates via cargo L: python labels Dec 13, 2024
@kbukum1 kbukum1 force-pushed the jurre/helper-sub-process-timeout branch from 77e08d4 to 132fedd Compare December 16, 2024 20:55
@kbukum1 kbukum1 force-pushed the jurre/helper-sub-process-timeout branch from ad5d950 to dbb517d Compare December 16, 2024 23:56
@kbukum1 kbukum1 marked this pull request as ready for review December 17, 2024 00:06
@kbukum1 kbukum1 requested a review from a team as a code owner December 17, 2024 00:06
)
.returns(T.nilable(T.any(String, T::Hash[String, T.untyped], T::Array[T::Hash[String, T.untyped]])))
end
def self.run_helper_subprocess(command:, function:, args:, env: nil,
stderr_to_stdout: false,
allow_unsafe_shell_command: false,
error_class: HelperSubprocessFailed)
error_class: HelperSubprocessFailed,
command_type: :network,
Copy link
Contributor Author

@kbukum1 kbukum1 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Tip: We set the default timeout to 120 seconds (using :network).

Note: The timeout applies to the time between outputs to stdout or stderr, not the overall execution duration of the command. Each time the command produces output, the timeout is reset, and it waits for the next response.

Copy link
Member

@jonabc jonabc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this feels more complex/confusing than it needs to be. This introduces magic symbols that callers must know to use, where those symbols are mapped to timeout values in the new command helpers class.

There's also some complexity coming from the addition of both command type and timeout arguments. As a caller if I didn't know the internals of this function I would be confused that a command that was run with timeout: -1 would have a timeout applied because command_type: :network has a special meaning. Is timeout an explicit value? An implicit value? I need to dig into the code to know how to call this.

WDYT about creating constants for different common timeout values and having callers use those constants to set a timeout value when needed. This removes having two params, makes the timeout field explicit, and maintains named common timeout values.

module CommandTimeouts
   NETWORK = 120 # seconds
   LONG_RUNNING = 300 # seconds
   NONE = -1 # seconds
   LOCAL_TASK = 30 # seconds

   DEFAULT = NETWORK
end

def self.run_helper_subprocess(... timeout: CommandTimeouts::DEFAULT...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fixed. Please let me know if you see any issue now.

Copy link
Contributor

@pavera pavera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave myself a crash course on Ruby IO stream handling, this appears correct to me based on the limited info 20 minutes of reading can provide.

Copy link
Member

@landongrindheim landongrindheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me. Feature flagging definitely seems like the right call here. It'd be great to see an incremental rollout to make sure everything is functioning as intended.

Comment on lines 117 to 118
# Use IO.select with a dynamically calculated short timeout
ready_ios = IO.select(ios, nil, nil, [0.1, remaining_timeout].min)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never worked with IO.select before, but the docs suggest the arguments should be Arrays of IO objects. What's the effect of using nil instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not monitoring for write readiness (nil) because writing is already completed at the beginning, and we do not perform incremental writes. Similarly, we do not monitor error conditions (nil) as they are not required in this context. For read readiness, we monitor the output streams (stdout and stderr) using a timeout to avoid blocking indefinitely. The timeout is set to the shorter of 0.1 seconds or the remaining timeout ([0.1, remaining_timeout].min), ensuring responsiveness while respecting the overall timeout limit.

Process.kill("KILL", pid) # Forcefully kill if still running
end
rescue Errno::EPERM
Dependabot.logger.error("Insufficient permissions to terminate process: #{pid}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in this case? Will the job still terminate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In natural way if it terminates (old way) we don't do anything. If it doesn't then the operation is going to get canceled after hanging out as how it was before. But generally we should have permission to do it since we initiated the command in this thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be surprised if this ever triggers TBH. This is only ever going to terminate a sub-process, and AFAIK the parent process should always have perms to terminate a child process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggree.

Comment on lines 56 to 61
DEFAULT_TIMEOUTS = T.let({
no_time_out: -1, # No timeout
local: 30, # Local commands
network: 120, # Network-dependent commands
long_running: 300 # Long-running tasks (e.g., builds)
}.freeze, T::Hash[T.untyped, T.untyped])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to keep these as T::Hash[T.untyped, T.untyped]? We're controlling them and all the values suggest to me that T::Hash[Symbol, Integer] would be the signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will fix it.

let(:output_hang_cmd) { command_fixture("output_hang.sh") }
let(:error_hang_cmd) { command_fixture("error_hang.sh") }
let(:invalid_cmd) { "non_existent_command" }
let(:timeout) { 2 } # Timeout for hanging commands
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@jonabc jonabc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment where I think the command api could be simplified a bit, otherwise looks alright 👍

)
.returns(T.nilable(T.any(String, T::Hash[String, T.untyped], T::Array[T::Hash[String, T.untyped]])))
end
def self.run_helper_subprocess(command:, function:, args:, env: nil,
stderr_to_stdout: false,
allow_unsafe_shell_command: false,
error_class: HelperSubprocessFailed)
error_class: HelperSubprocessFailed,
command_type: :network,
Copy link
Member

@jonabc jonabc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this feels more complex/confusing than it needs to be. This introduces magic symbols that callers must know to use, where those symbols are mapped to timeout values in the new command helpers class.

There's also some complexity coming from the addition of both command type and timeout arguments. As a caller if I didn't know the internals of this function I would be confused that a command that was run with timeout: -1 would have a timeout applied because command_type: :network has a special meaning. Is timeout an explicit value? An implicit value? I need to dig into the code to know how to call this.

WDYT about creating constants for different common timeout values and having callers use those constants to set a timeout value when needed. This removes having two params, makes the timeout field explicit, and maintains named common timeout values.

module CommandTimeouts
   NETWORK = 120 # seconds
   LONG_RUNNING = 300 # seconds
   NONE = -1 # seconds
   LOCAL_TASK = 30 # seconds

   DEFAULT = NETWORK
end

def self.run_helper_subprocess(... timeout: CommandTimeouts::DEFAULT...)

@kbukum1 kbukum1 force-pushed the jurre/helper-sub-process-timeout branch from b6aef5f to 78c2014 Compare December 17, 2024 22:14
reduce timeout parameter complexity
@kbukum1 kbukum1 requested a review from jonabc December 17, 2024 23:26
Copy link
Member

@jonabc jonabc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks for the updates! LGTM when CI is happy 😄

@kbukum1
Copy link
Contributor Author

kbukum1 commented Dec 17, 2024

👍 thanks for the updates! LGTM when CI is happy 😄

Thanks for the thorough review! I'll merge once the CI is happy.

@kbukum1 kbukum1 merged commit b43db8c into main Dec 18, 2024
127 checks passed
@kbukum1 kbukum1 deleted the jurre/helper-sub-process-timeout branch December 18, 2024 01:58
@abdulapopoola abdulapopoola changed the title WIP: Set timeout for helper subprocesses to enhance stability Set timeout for helper subprocesses to enhance stability Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants